12/3/2020

Real data table

Real data plot

Synthetic data

Synthetic data can be generated easily, like this:

# specify cart and alter predictor matrix
cart <- rep("cart", ncol(dat))
names(cart) <- colnames(dat)
cart['bmi'] <- "~I(wgt / (hgt/100)^2)"

pred <- make.predictorMatrix(dat)
pred[c("wgt", "hgt"), "bmi"] <- 0

syns <- dat %>% mice(m = 5, 
                     method = cart,
                     predictorMatrix = pred,
                     where = matrix(TRUE, nrow(dat), ncol(dat)),
                     print = F,
                     seed = 123)

Real and synthetic data

Something interactive

On the right, there is a plot of the distribution of age, with the actually observed data in red, and the synthetic data averaged over the five imputation rounds in blue.

Inferences from synthetic data

To make correct inferences from the synthetic data, we need to use the correct estimators. For instance, we could have

\[ \begin{align} \bar{q}_m &= \frac{1}{m}\sum^m_{i=1} q^{(i)}, \\ b_m &= \sum^m_{i = 1} \frac{(q^{(i)} - \bar{q}_m)^2}{m-1}, \end{align} \]

\[ \begin{align} \bar{u}_m &= \frac{1}{m} \sum^m_{i = 1} u^{(i)}, \\ T_f &= (1 + \frac{1}{m})b_m - \bar{u}_m, \end{align} \]

 

with \(\bar{q}_m\) the mean of the estimates, \(b_m\) the between-variability, \(\bar{u}_m\) the within data variability, \(u^{(i)}\) the within-variance and \(T_f\) the total variance of the estimate, as proposed by Raghunathan, Reiter, & Rubin (2003).

References

Raghunathan, T. E., Reiter, J. P., & Rubin, D. B. (2003). Multiple imputation for statistical disclosure limitation. Journal of Official Statistics, 19(1), 1.